Nothing new under the sun: Mediaeval line plot, circa 1010. Image: wikimedia
If you haven’t already completed the first data vis worksheet, do that now.
Facets
As we add layers our plots become more complex. We run into trade-offs between information density and clarity.
To give one example, this plot shows life expectancies for each country in the development data, plotted by year:
development %>%
ggplot(aes(year, life_expectancy, group=country)) +
geom_smooth(se=FALSE)Explanation: This is another x/y plot. However this time we have not added points, but rather smoothed lines (one for each country).
Explanation of the code:We have created an x/y plot
as before, but this time we only added geom_smooth (and not
geom_point), so we can’t see the individual datapoints. We
have also added the text group=country which means we see
one line per-country in the dataset. Finally, we also added
se=FALSE. This hides the shaded area which
geom_smooth adds by default.
Comment on the result: It’s pretty hard to read!
To increase the information density, and explore patterns within the data, we might add another dimension and aesthetic. The next plot colours the lines by continent:
development %>%
ggplot(aes(year, life_expectancy, colour=continent, group=country)) +
geom_smooth(se=FALSE)However, even with colours added it’s still a bit of a mess. We can’t see the differences between continents easily.
To clean things up we can use a technique called facetting.
development %>%
ggplot(aes(year, life_expectancy, group=country)) +
geom_smooth(se=FALSE) +
facet_grid(~continent)Explanation: We added the code
+ facet_grid(.~continent) to our previous plot, and removed
the color aesthetic. This made ggplot create individual
panels for each continent. Splitting the graph this way makes
it somewhat easier to compare the differences between
continents.
Try facetting yourself
Use the iris dataset, which is built into R.
Try to recreate this plot by adapting the code shown above
Create a second plot which uses colours to distinguish species and does not use Facets
In this example, which plot do you prefer? What influences when facets are more useful the just using colour?
iris %>%
ggplot(aes(Sepal.Length, Petal.Length)) +
geom_point() +
geom_smooth() +
facet_grid(~Species)iris %>%
ggplot(aes(Sepal.Length, Petal.Length, color=Species)) +
geom_point() +
geom_smooth() There’s no single right answer here, but for this example I prefer the coloured plot to the faceted one. The reason is that there are only 3 species in this dataset, and the points for each don’t overlap much. This means it is easy to distinguish them, even in the combined plot.
However, if there were many different species it might be helpful to use facets instead.
Our decisions should be driven by what we are trying to communicate with the plot. What was the research question that motivated us to draw it?
Adjusting the facets
In the explanation above, we saw this plot:
development %>%
ggplot(aes(year, life_expectancy, group=country)) +
geom_smooth(se=FALSE) +
facet_grid(~continent)Re-run the plot and make the following adaptations:
Try replacing
facet_grid(~continent)withfacet_grid(continent~.). What happens?Try replacing
facet_gridwithfacet_wrap(~continent). What happens?If you have time, to see more faceting examples, see the
ggplotcookbook documentation. Try to re-create some of these examples.
Comparing categories
In the examples above we have been plotting continuous variables (and adding colours etc). We’ve used density, scatter and smoothed line plots to do this.
Another common requirement is to use plots to compare summary statistics for different groups or categories. For example, the classic plot in a psychology study looks like this:
However, there is evidence that readers often misinterpret bar plots. Specifically, the problem is that we perceive values within the bar area as more likely than those just above, even though this is not in fact the case.
A better choice is (almost always) to use a boxplot:
expdata %>%
ggplot(aes(x=stimuli, y=RT)) +
geom_boxplot()Explanation: We used Condition, a
category, as our x axis, and reaction times as the y axis. We added
geom_boxplot to show a boxplot.
If you’re not familiar with boxplots, there are more details in the
help files (type ?geom_boxplot into the console) or use the
wikipedia page
here
Loading data over the internet
So far we have always used data from the psydata
package. However, we can also load data from files or over the internet
using the read_csv() function. This is an example:
expdata <- read_csv('https://gist.githubusercontent.com/benwhalley/f94baf447612e2434b181739dbba27df/raw/43df26022fff68f49918c795f27d7352dc0d3425/expdata.csv')
expdata %>% glimpseRows: 240
Columns: 4
$ Condition [3m[38;5;246m<chr>[39m[23m "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A"…
$ stimuli [3m[38;5;246m<chr>[39m[23m "S1", "S1", "S1", "S2", "S2", "S2", "S3", "S3", "S3", "S4", "S4", "S4", "S1", …
$ p [3m[38;5;246m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, …
$ RT [3m[38;5;246m<dbl>[39m[23m 151.84776, 313.68298, 556.61563, 53.63154, 560.36673, 586.03610, 504.98535, 27…
Explanation: The part which starts
expdata <- read_csv('https://... loads a CSV (data) file
over the internet, and stores it in the variable called
expdata. You can click this
link to see the raw data. We then use glimpse to check
the data
Load the (simulated) dataset from this url: https://gist.githubusercontent.com/benwhalley/f94baf447612e2434b181739dbba27df/raw/43df26022fff68f49918c795f27d7352dc0d3425/expdata.csv (note that this URL is very long — make sure you copy the whole of it).
- Recreate the boxplot above
- Use a facet to recreate the plot you saw above, combining both
ConditionandStimuli
expdata %>%
ggplot(aes(x=stimuli, y=RT)) +
geom_boxplot() +
facet_wrap(~Condition)Other data summaries
If you really need to plot the mean and standard error or
standard deviations of different categories, ggplot also has the
stat_summary command:
expdata %>%
ggplot(aes(Condition, RT)) +
stat_summary()Explanation: We used Condition and
RT as our x and y axes, as before. This time we added
stat_summary() instead of geom_boxplot(). By
default this plots the mean and standard error (a measure of
variability) in each group, using a point-range plot.
This is better than a bar chart because it avoids a known bias in how we
read them. You can ignore the warning about
No summary function supplied, defaulting to mean_se() for
now.
- Adapt your facetted boxplot from above to show the mean and standard error instead
expdata %>%
ggplot(aes(x=stimuli, y=RT)) +
stat_summary() +
facet_wrap(~Condition)Multiple summary layers
A “violin plot” is a cross between a boxplot and a density plot.
They are simple ot make in ggplot; we can simply replace
geom_boxplot() with geom_violin():
expdata %>%
ggplot(aes(x=stimuli, y=RT)) +
geom_violin() +
facet_wrap(~Condition)However many authors find it helpful to overlay additional summaries like the mean and standard error of a variable. For example:
Make a new plot, based on the code above, which combines
geom_violin() with stat_summary()
expdata %>%
ggplot(aes(x=stimuli, y=RT)) +
geom_violin() +
stat_summary() +
facet_wrap(~Condition)Decisions about grouping
Consider the following two plots. Both present the same data on the incomes of men and women who are either blind or normally sighted:
- Which plot do you prefer, and why?
I don’t have a strong overall preference between the plots, but I think plot A is probabably more useful. My reasoning is that in this plot we can see that:
- Men are paid more
- Blind people are paid less
- The gender-pay gap appears to be larger for blind people
I think that third point is slightly harder to read/understand from plot B.
The idea to take-away is that grouping the data in the plot in different ways can make it easier or more difficult to identify different patterns.
- Re-organize the plot below to make it easier to see which level of education has the highest gender pay gap.
earnings %>%
ggplot(aes(x=education, y=income/1000)) +
facet_grid(~gender) +
stat_summary() +
labs(y = "Income (£ 1000)", x="Gender")To reorganise the panels, you will need to amend the variables used to facet the plot.
earnings %>%
ggplot(aes(x=gender, y=income/1000)) +
facet_grid(~education) +
stat_summary() +
labs(y = "Income (£ 1000)", x="Gender")- Try adding
scale_y_log10()to your updated plot. How does this change the impression you have of the data? Which plot is the most “truthful”?
By adding a log scale to the gender pay gaps appear to be more equal across different education categories.
earnings %>%
ggplot(aes(x=gender, y=income/1000)) +
facet_grid(~education) +
stat_summary() +
scale_y_log10() +
labs(y = "Income (£ 1000)", x="Gender")Either plot can be considered more or less ‘truthful’, depending on your perspective:
If you believe a £2000 pay gap is the same for people earning around £20,000 as it is for those earning £100,000 then the first plot is more helpful. We can see there that largest pay gaps are among the most educated and so highest-earning.
If we consider that £2000 means less to those earning very high incomes, then the second plot is more helpful. This plot shows that, relative to total income, the size of the pay gap is pretty consistent across education categories.
However the elephant in the room is that income has a very skewed distribution:
earnings %>%
ggplot(aes(income)) + geom_density()This means that medians for men/women are very different to the means, and the median gender pay gap is different to the mean gender pay gap:
earnings %>%
group_by(blind, gender) %>%
summarise(m=mean(income)) %>%
group_by(blind) %>%
summarise(diff(m)) %>%
set_names(c("Vision", "Mean pay gap (£)")) %>%
pander()| Vision | Mean pay gap (£) |
|---|---|
| Blind | 20370 |
| Normally sighted | 8473 |
earnings %>%
group_by(blind, gender) %>%
summarise(m=median(income)) %>%
group_by(blind) %>%
summarise(diff(m)) %>%
set_names(c("Vision", "Median pay gap (£)")) %>%
pander()| Vision | Median pay gap (£) |
|---|---|
| Blind | 1735 |
| Normally sighted | 8222 |
The moral here is that different summaries of the data tell different stories: We should explore our data thoroughly, and avoid jumping to conclusions!
More practice
Scenario 1: Secret agent
You are an MI6 agent, and have been sent a mystery dataset by one of your spies. She said it contains highly important information which will be of great interest to your superiors.
- Use your ggplot wizardry (and techniques we have covered today) to recover this classified information and present it in a single graphic for Q.
You can load the data using a new command, read_csv().
This command accepts either the name of a file, or a URL that the data
is stored on the web.
The file URL is: https://gist.githubusercontent.com/benwhalley/2103941f0c2ac0449247f59fe9c976b6/raw/4630367728610f0a9937d71eeac6fb511065ee46/mystery.csv
To load the data, you might write:
mystery <- read_csv('https://gist.githubusercontent.com/benwhalley/2103941f0c2ac0449247f59fe9c976b6/raw/4630367728610f0a9937d71eeac6fb511065ee46/mystery.csv')
mystery %>% glimpseRows: 1,846
Columns: 3
$ category [3m[38;5;246m<chr>[39m[23m "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D",…
$ x [3m[38;5;246m<dbl>[39m[23m 55.3846, 51.5385, 46.1538, 42.8205, 40.7692, 38.7179, 35.6410, 33.0769, 28.9744…
$ y [3m[38;5;246m<dbl>[39m[23m 97.1795, 96.0256, 94.4872, 91.4103, 88.3333, 84.8718, 79.8718, 77.5641, 74.4872…
Scenario 2: Admissions tutor
In the 1970s the University of California, Berkley, was concerned about the fairness of their admissions procedures. They collected data from across the university for a number of years, recording the:
- Number of applicants
- The department the student applied to
- The students’ gender
- Number of students accepted
- The percentage students of each gender who were accepted in each department
A summary of these data are available at this link: Berkley applications data
Your job is to:
- Describe the pattern of applications
- Decide if the university was fair in it’s admissions procedures
- Prepare a short presentation for the university governors which includes plots
Techniques/commands you might want to use:
filtergroup_byandsummarisestat_summaryfacet_wrap()orfacet_grid()
berkley <- read_csv('https://raw.githubusercontent.com/benwhalley/datafluency/master/website/data/berkley.csv')
berkley %>% glimpseRows: 72
Columns: 6
$ year [3m[38;5;246m<dbl>[39m[23m 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976,…
$ gender [3m[38;5;246m<chr>[39m[23m "Female", "Female", "Female", "Female", "Female", "Female", "Male", "Ma…
$ department [3m[38;5;246m<chr>[39m[23m "A", "B", "C", "D", "E", "F", "A", "B", "C", "D", "E", "F", "A", "B", "…
$ n.applicants [3m[38;5;246m<dbl>[39m[23m 133, 34, 785, 462, 511, 444, 1095, 740, 416, 536, 234, 485, 145, 33, 76…
$ n.admitted [3m[38;5;246m<dbl>[39m[23m 107, 20, 314, 146, 149, 318, 725, 455, 141, 205, 62, 297, 108, 21, 256,…
$ percent.admitted [3m[38;5;246m<dbl>[39m[23m 80.45, 58.82, 40.00, 31.60, 29.16, 71.62, 66.21, 61.49, 33.89, 38.25, 2…
This plot suggests a paradox: Although more men applied and were admitted, the proportion of male applicants who were successful was actually lower than for women:
Breaking the data down by-department reveals what is happening:
Note that you don’t (yet) know how to produce the plot above. We will cover this in the next sessions, but for now you could make two individual plots, like so:
This is an example of Simpson’s paradox. You can read more about the case here: http://corysimon.github.io/articles/simpsons-paradox/